Distinguishing speech from multiple users in a computer interaction

ABSTRACT

Speech from multiple users is distinguished. In one example, an apparatus has a sensor to determine a position of a speaker, a microphone array to receive audio from the speaker and from other simultaneous audio sources, and a processor to select a pre-determined filter based on the determined position and to apply the selected filter to the received audio to separate the audio from the speaker from the audio from the other simultaneous audio sources.

FIELD

The present disclosure relates to speech processing for computerinterfaces and, in particular, to distinguishing speech from differentcomputer users.

BACKGROUND

Speech recognition systems are used by automated telephone answeringsystems, by automobiles for navigation and telephone controls, bycomputers for commands and dictation, by gaming machines for game play,by televisions for channel selection, and by portable telephones forhands free command and query systems. In these and many other systems,the user speaks into a microphone and the system analyzes the receivedaudio to determine whether it corresponds to a command or query. Thespeech recognition may be done on a small local processor or the signalsmay be sent to a larger server or other centralized system forprocessing.

Speech recognition systems rely on the microphones and receiving systemsfor accurately receiving the voice and then for filtering out othernoises, such as wind, machinery, other speakers, and other types ofnoises. For a telephone or a computer gaming system, there may be verylittle other noise. For a portable telephone or a computer, there may bemore ambient noise and other speakers may also be audible. A variety ofdifferent noise cancellation systems have been developed to isolate theuser's voice from the noise. For portable telephones, two microphonesare often used. The main microphone is directed to the speaker and anoise cancellation microphone is pointed in a different direction. Thenoise cancellation microphone provides the background noise which isthen subtracted from the audio received in the main microphone.

Blind Source Separation (BSS) has been developed to separate the voicesof two speakers that speak at the same time. BSS typically uses morecomplex audio processing than a simple noise cancellation microphone.BSS refers to techniques that extract voices or other audio or othertypes of signals that are from different sources from a mixture of thesesignals without using any specific knowledge of the signals, the signalsources, or the positions of the signal sources. BSS requires only thatthe different sources be statistically independent which is the casewhen the sources are voices from different people. One voice may befiltered out or the two voices may be separated so that both areprovided to a speech recognition system. When multiple speakers interactwith a system simultaneously, multiple microphones capture the combinedspeech. BSS is intended to separate the speaker's voices into separatechannels, by generating “de-mixing” filters such asfinite-impulse-response (FIR) filters. When the filters depend on thevoice source locations, BSS requires re-training (i.e. re-calculatingthe filters) whenever any of the speakers change location.

BRIEF DESCRIPTION OF THE DRAWINGS

Embodiments of the invention are illustrated by way of example, and notby way of limitation, in the figures of the accompanying drawings inwhich like reference numerals refer to similar elements.

FIG. 1 is a diagram of audio signals in an audio environment andapparatus for separating simultaneous audio signals according to anembodiment of the invention.

FIGS. 2A to 2H are simplified graphs of finite impulse response filtervalues for different audio signals that may be used in the apparatus ofFIG. 1 according to an embodiment of the invention.

FIGS. 3A and 3B are front and side view respectively of an isolationvolume for separating simultaneous audio signals according to anembodiment of the invention.

FIG. 4 is a diagram of training an apparatus to generate finite impulseresponse filters according to an embodiment of the invention.

FIG. 5 is a process flow diagram of training an apparatus such as thatof FIG. 4 according to an embodiment of the invention.

FIG. 6 is a diagram of a voice recognition device after trainingaccording to an embodiment of the invention.

FIG. 7 is a process flow diagram of selecting a filter for a device suchas that of FIG. 6 according to an embodiment of the invention.

FIG. 8 is a diagram of using a fixed voice recognition device accordingto an embodiment of the invention.

FIG. 9 is a diagram of using a mobile voice recognition device accordingto an embodiment of the invention.

FIG. 10 is a block diagram of a computing device incorporating a voicerecognition device according to an embodiment.

DETAILED DESCRIPTION

A Blind Source Separation (BSS) technique may be used to provide aspeech interface to any of a variety of different devices includingcomputers, gaming machines, televisions, and telephones both fixed andmobile, among others. BSS may be used to separate the simultaneous voicebased interaction of two or more users, by first separating the speechfrom different users and then transcribing the speech through someautomatic recognition engine. The speech may be recorded, transmitted,used as a command, or applied to a variety of other purposes.

BSS refers to many different techniques that distinguish audio fromdifferent sources without any knowledge of the audio source. BSSgenerally relies on an assumption that the characteristics of the audiofrom the different sources are statistically independent orstatistically uncorrelated. These techniques are able to distinguishdifferent sources, however, some processing is required to recognize,analyze, and distinguish the audio sources. Since audio tends to reflectfrom surfaces and then interfere with other reflections and the originalsource, in any particular environment the audio received from anyparticular source may change if the source moves. In, for example, thecase of speakers that are moving around a room, the statisticalcharacteristics of the audio received from each speaker changes soquickly, that it is difficult to separate different moving sourceswithout a significant delay.

BSS normally requires re-training each time a signal source moves. Thisre-training causes some delay between when the speech is received andwhen it can be extracted. The delay may obscure the first part of thespeech or cause a delay until after the system is re-trained. For aspeaker that moves frequently, the delay may render the systemimpractical. For higher quality separation, more complex filters areused which require even more accurate and frequent re-training. Thecomplex and frequent re-training also requires significant processingresources.

To eliminate the delay and computational load, a BSS system may beinitially trained with a single generalized voice for multiple locationswithin a space. During this initial training, the delay between multiplemicrophones may be forced to zero and a set of de-mixing filters may begenerated for each position. Then, different positions in space may beemulated by varying the transfer functions of the de-mixing filterscorresponding to each different position. The different sets of transferfunctions are stored. An appropriate filter transfer function is thenselected for use in the BSS based on the position of the desiredspeaker. Using the stored transfer functions, BSS and similar techniquesmay be used without additional training. Users are able to move aroundwhile interacting with the system. This allows for simultaneous multipleuser speech recognition.

FIG. 1 is a diagram of separating signals from two different speakersusing a blind signal separation or a similar technique. In any location102, sound h11, h22 from many different sources s1, s2 is mixed by theenvironment. The mixed signals captured by a system's 104 array ofmicrophones x1, x2 include sound from more than one source. The soundincludes sound received directly from a source h11, h22 and soundreceived indirectly from a source h21, h12. The indirect sound may comefrom indirect propagation h21, h12, from reflections, from echoes andfrom resonance in the environment 102.

Using the input from this array x1, x2, the system 104 generatesde-mixing filters 106 that are applied to the mixed signals. The filtersare typically but not necessarily FIR (Finite Impulse Response) filtersw11, w12, w21, w22 which generate final output signals y1, y2. Thesystem output interface 108 supplies these de-mixed signals y1s1 andy2s2 to other processes. In FIG. 1, these other processes are a commandinterface 110 which interprets the sound as a spoken command andprovides the command to a CPU 112 for execution, however, the inventionis not so limited.

The term “de-mixing” filter refers generally to any of a variety ofdifferent types of filters that may be used to separate one source formother sources and from ambient noise. For BSS, the de-mixing filters aretypically finite impulse response (FIR) filters, however, the inventionis not so limited. A fast Fourier transform (FFT) is performed on thereceived audio. The resulting frequency domain signal is then applied tothe FIR filter, for example by convoluting the signal with the filter.An inverse FFT is applied to the filter signal and the separated audiois accordingly recovered in the time domain for further processing. Sucha process may be combined with other processing to obtain even betterresults or to improve the audio quality. The specific nature of the FIRfor each situation and condition is typically determined empirically foreach audio environment through training, however, the invention is notso limited.

The de-mixing filters are generated using a training, tuning, orcalibration process. The filters depend on the position of themicrophones x1, x2, the environment 102, and the position of the soundsources s1, s2. When the sources change positions, the mixingenvironment changes, and new filters w11, w21, w12, w22 are used. If thesound source is a human user, then the user is very likely to changeposition frequently and the system 104 retrains frequently.

FIGS. 2A to 2F show simplified graphs of FIR values for differentsignals identified in FIG. 1. The vertical axis shows amplitude againsta horizontal time axis. FIG. 1A shows an amplitude impulse h11 generatedby the first source s1 as received at the first microphone x1. Theimpulse is positioned at about 125 on the time scale with an amplitudeof 0.8. FIG. 2B shows the same impulse h12 generated by the same sources1 as it is received at the second microphone x2. The amplitude pulsehas the same shape but is delayed to a time of about 150 and attenuatedto about 0.7. Similarly FIG. 2D shows the amplitude impulse h22generated by the second source s2 as received at the second microphonex2. FIG. 2C shows this signal h21 from the second source s2 as receivedat the first microphone x1 with a delay and an attenuation.

The signals are mixed by the ambient of the environment and received inthis mixed condition at the receiving microphones x1, x2. They may alsobe mixed with echoes, noise, resonances, and other signal sources notshown here in order to simplify the diagram. FIG. 2E is an example of aFIR filter w11 that would be applied to the signal received by the firstmicrophone based on the original source signal h11. FIG. 2G is anexample of a filter signal w12 that would be applied to the secondsource signal as received by the first microphone h12. By applying thesesignals to the first microphone signal, the signal from the first sourceis enhanced and the signal from the second source is attenuated.Similarly, FIG. 2F is an example of a filter signal w21 that would beapplied to the first source signal as received by the second microphoneh21 and FIG. 2H is an example of a signal that would be applied to thesecond source signal as received by the second microphone h22. Similarfilter signals may be generated for echoes, noise, resonances and othersignal sources. These may be combined with the illustrated filtersignals.

Since the signals at the microphones are mixed, the filter signals areapplied to the mixed signals and not to single isolated signals. Thefirst microphone signal may be processed using two filter signals w11,w12 or all four filter signals. The result is the enhancement of onereceived sound signal and the suppression of all others.

In establishing the filters, the specific parameters and the nature ofthe filters are selected so that the speech recognition of the commandinterface is successfully able to recognize the spoken commands. Therequired amount of separation determines the complexity of the training,the accuracy of the filters and the required precision for eachspeaker's location. The position of a speaker may be identified as beingwithin a certain range of positions. When the speaker is within aspecific range of a particular central position, then the same filtermay be used. When the speaker moves too far from the central position,then a different filter is used in order to maintain sufficientseparation.

The range of positions around a particular central position that use thesame filter parameters is referred to herein as a separation bubble. Theseparation bubbles determine the range of movement allowed for each setof filter parameters. Typically adjacent bubbles will overlap so that,at the edge of two bubbles, similar results are obtained using eitherthe filters for one of the bubbles or the filters for the other one ofthe bubbles. The maximum size of the separation bubble is determined atleast in part by the required amount of separation. The bubbles may alsochange in size if there is a change in the amount of diffusivebackground noise and for different rooms.

A BSS filter set established, for example by training with an audiosource, such as a speaker, at a certain position reduces its separationperformance as the speaker moves away from the original trainedposition. FIG. 3A is a diagram of a front view of a separation volume orseparation bubble 304 surrounding a source 302 in which BSS performanceis acceptable. As mentioned above, the amount of separation that is tobe considered acceptable depends upon the particular use of the audiosignals. Less separation may be required to distinguish between alimited set of commands than would be required to transcribe spokenwords into text. In the example of FIG. 3A, a dynamic loudspeaker withvoice coil drivers is shown as an example of the speaker. However, anyother type of speaker that is able to provide repeatable results may beused. In some cases, speaker is used herein to refer to an electricaldevice and in other cases it is used to refer to person that isspeaking.

In some voice recognition tests, if the BSS routine can separate out thedesired audio signal by a separation factor of 70% this may be enough toachieve acceptable voice recognition. 70% corresponds to the area shownas the inner bubble 304. A signal produced anywhere in in the innerbubble will provide for at least 70% separation for the desired speakerusing filters that are based on a signal at the center of the innerbubble. The inner bubble has a volume of the shape of an ellipsoid. FIG.3A shows a circular cross-section as viewed from the front from theperspective of the microphones. FIG. 3B shows the same inner bubble 304from the side and shows that the bubble is taller than it is deep,forming an ellipse from this view. The bubble has the shape of anellipsoid, with the lower radius in the vertical direction as shown onthe page.

For less demanding applications, a larger bubble may be used. FIGS. 3Aand 3B show a larger central bubble which provides a separation of atleast 50% using the same filter trained on the center of the bubbles.Similarly for even less demanding applications, some lesser amount ofseparation may be required. An outer bubble 308 is also elliptical asviewed from the side and represents separations that range from 50% atthe inner edge to 0% at the outer edge. In other words when the sourceis positioned at the outer edge of the outer circle and the signals arefiltered based on the source being at the center of the bubble, then thesystem is unable to separate the signals at all. Depending on theapplication for the audio signals, if the source moves too far from thecenter, then a new set of filters is required. The new set of filterswill correspond to a new bubble next to, and perhaps partiallyoverlapping with, the bubble shown in FIG. 3A. As an example, the 50%bubble of the neighboring bubble may abut, adjoin, or nearly abut, the50% bubble 306 of the illustrated bubble. For better separation of thesignals, the 70% bubbles may abut each other so that the 50% bubblescompletely overlap on one side.

The shape and size of the separation bubble in any particularimplementation depends on the microphone positions. The size of thebubble might vary for different microphone positions. Because thebubbles and corresponding filters require some time to generate and varyfor different systems and environments, multiple separation bubbles maybe generated in advance and stored, for example, in a lookup table. Theoperation of a BSS system may then be emulated without requiring thedelay and processing power that BSS consumes.

A training process is shown in the example of FIG. 4. A computing device402 is placed near different possible positions for speakers. The voicerecognition device may be a computer, an information display terminal, agaming or entertainment device, a remote communications station or anyother device that is to distinguish between different simultaneousspeakers. The device 402 is equipped with multiple microphones, in thiscase three 404, 406, 408. The microphones are spaced apart from eachother and placed on different faces of the device so that eachmicrophone receives a different audio signal from the ambientenvironment.

The device is trained to determine filter parameters that correspond todifferent speaker positions with respect to the device. In theillustrated example, there is a first bubble 410 for which the device isalready trained. The device is then training for a second bubble 412. Tothis end, a speaker 414 is placed in the center of the new bubble 412.The speaker produces audio 416 that is received by the microphones 404,406, 408. The acoustic qualities of the environment act on the audio asit propagates to the device. If the audio signal is known, then thereceived signal can be compared to the known signal to generate filterparameters for the particular location. This may be repeated for as manylocations as desired.

The device 402 includes many additional components so that the receivedaudio signals may be put to practical use. These components may includea separation module 422 coupled to the microphones 404, 406, 408. Theseparation module may have filters for performing a blind sourceseparation, buffers for holding audio signals for analysis, and memoryfor storing filter parameters and other components. The separationmodule may perform other operations to separate the audio signal sourcesin addition to or instead of blind source separation. The separationmodule may be coupled to a command interface 424 to interpret theseparated audio as a command or some other signal. A CPU 426 is coupledto the command interface to receive the commands and other signals andrun appropriate operations or functions in response to the commands. TheCPU is coupled to a memory 428 to store programming instructions,temporary values, parameters and results, and to a display 430 tointeract with the user. The display may include a touchscreen to receiveuser input or there may be other user input devices, such as buttons,keys, or cameras. The device may have additional components (not shown)to facilitate additional functions, depending on the particularimplementation.

FIG. 5 shows a process flow for obtaining filters for the new locationsusing the system configuration shown in FIG. 4. A system such as thevoice recognition device 402 of FIG. 4, first obtains at 502 transferfunctions (TF) from different locations in a particular environment orsetting, for example a room. The TF's may be in the form of digitalfilters that emulate the transfer of audio from one spatial point toanother. In the present example, the first spatial point is the locationof the speaker and the second spatial point is at each microphone. Forthe case of three microphones, there are three TF's for each bubble. TheTF's can be obtained using, for example sweeping chirps, white noise, orany other suitable signal and comparing the received signal to the knownsignal that was produced by the speaker.

After obtaining the TF's, the TF's can be used to filter any audiosignal and emulate the effect of that signal traveling from one locationto another Then at 504, the original known signals from the originallytrained bubble are taken. These represent a single true recording. At506, the signals from these true recordings are forced to zero delay byremoving the delay between signals recorded from different locations.

To emulate any new point in space, recorded signals are filtered at 508with the transfer functions of the new locations from 502, whichgenerate new audio signals that act as if they were recorded on the newlocation. In other words, by filtering a recorded signal with the threeTF's for a particular location, three new audio signals are obtainedthat are very close to a real signal from that location as it would bereceived by the microphones. The filtered signals are fed into BSS at510. Then, de-mixing filters are obtained at 512 using BSS.

This approach to obtaining de-mixing filters is equivalent to doing thetraining for each new position. Many different locations can be trainedaround the recognizing device. The de-mixing filters are stored at 514in a position lookup table. From the position table, the filters may beused by the voice recognition device to isolate a speaker at eachlocation. Such a routine generates separation bubbles around therecognition device or microphone array for any position that a usermight take.

The emulated training produces separation levels that are not onlysufficient but almost as good as the separation levels produced withreal training. Separation levels of 25-30 dB may be obtained in bothcases, with little difference in the separation level between emulatedand real training. Emulated and real training may be performed with onevoice and then used for another very different voice. As an example whentraining is done with a female voice, the same filters may be used withgood results for a low voice.

In regular operation, the location of the user will be detected byacoustical means using cross correlation, for example or by opticalmeans using cameras. The system will search in the lookup table for thede-mixing filters that fit with the detected location for that user.These de-mixing filters are then used to separate the user's speech. Thesystem can separate simultaneous speech from multiple users in differentbubbles as long as their locations can be determined. The users'movements can be tracked using the location detection. The system canthen change the filter selection as each user moves to a differentbubble. With a sufficiently precise location determination, the bestseparation bubble can be determined to perform successful voiceseparation.

FIG. 6 is a diagram of a voice recognition device 602 for which traininghas been completed. There is an originally trained BSS bubble 610 and asurrounding ring of added bubbles 612A to 612-M. Each of these bubblesmay be trained individually or the emulated training of FIG. 5 may beused to add additional bubbles in new locations. In this example aspeaker 620 in a current location speaks directly to a voice recognitiondevice 602. The speaker's position may be determined either using themicrophones 622 or one or more cameras 624. A variety of differentlocation techniques may be used from delay measurement and triangulationto various depth of field techniques through stereo vision or stereoaudio. The position of the speaker is determined and then applied to afilter selector. The filter for that location is selected and thespeaker's voice may be distinguished from the voice of another speaker.

The selection of a filter is shown in more detail in the process flowdiagram of FIG. 7. In FIG. 7 at 702 user speech is detected. At 704 theuser's location is detected using acoustical or optical technology. At706 the closest separation bubble, for example, one of the bubbles 612-Ato 612-M of FIG. 6 is selected based on the location. The appropriatede-mixing filter is then selected using that location. At 708 the user'sspeech is applied to the selected de-mixing filter to separate thespeech from other sounds and at 712 the speech is provided to adownstream device or function. As the user 620 continues working in thevicinity of the voice recognition device 602, the speaker may move to adifferent position. This may be tracked and detected using themicrophones or cameras when the user moves at 710. The location is thendetected and at 704 the closest separation bubble is chosen. Theappropriate filters for the bubble are applied and the speech is thenseparated.

FIG. 8 is an example of how such a voice recognition device may be usedin a practical application. In FIG. 8 a voice recognition device isintegrated into a large panel computing system 802. The display table802 includes multiple microphones and may also include cameras,computers, and communication resources. The voice recognition device andvoice separation system is incorporated into the computing resources ofthe display table. A first user 804 generates speech 806 at onelocation. In the illustrated example the speech is “show me the closestavailable conference room.” As shown on the display table, the tableresponds with displayed text 818 which is “go to room JF-205” and adisplayed arrow indicating the direction of the room. As shown, sincethe system has determined the location of the speaker, the response maybe oriented toward the speaker for easier viewing.

Simultaneously a second user 814 is also generating speech. In this casethe speech is “Do you know how is the weather out there?” This speech816 is also received by the display table 802 and a weather prediction808 is displayed on the table. The second user is on the opposite sideof the display table. Accordingly, the weather prediction is invertedfrom the conference room text so that it is more easily understood bythe second user.

Because these two speakers are in different locations which correspondto different separation bubbles (not shown) the simultaneous speech canbe separated and the display table can simultaneously respond to bothqueries provided by the two users. While voice queries are shown in thisexample, a wide variety of different commands, instructions and requestsmay be provided by the users and acted upon by the display table 802.While a table configuration is shown for the computing device, a widerange of other configurations may be used. The computing device may beon a stand, suspended from a wall, placed vertically, horizontally or inany other position. There may be multiple displays and an appropriateone of the displays selected based on the user's detected position. Themicrophones and cameras may be incorporated into the computing system orplaced very close nearby or they may be placed in a different locationwith a large distance between different microphones or cameras toprovide a more accurate estimate of the location.

While the example of FIG. 8 shows an open space with the users and thecomputing device, typically this space will correspond to a room withwalls surrounding the users. The training can be made to take thesewalls into consideration. Such a system may be used in a wide variety ofdifferent open and closed environments, large and small.

FIG. 9 shows an alternative situation in which two people are sitting ina car and talking simultaneously. The same approach as in the FIG. 8 maybe used in this much smaller closed environment. In this example one ofthe users is giving commands to a computing system in the car while theother user is not giving commands to the system but instead speaking tosomeone else using a separate portable telephone.

Specifically, a computing system 902 is mounted to a vehicle. A firstuser 904, in this case the driver, issues a command in spoken form whichin this case is “Can you recalculate my route? There is a detour on MainStreet.” The command is received by an in-car computer 902. The computerprovides an audio response 908 in the form of a spoken word “sure.” Thecomputer also provides a visual response 920 in the form of a map of therecalculated route. Simultaneously, a second user in the car 914 isspeaking but not to the computing device.

In this example, the second user 914 is speaking a statement “Yes,sounds good. Tell him I said ‘hi’, please” into a personal telephone 918which is not connected with the car computing device. The two users aresitting in different locations which correspond to the two front seatsof the car. These two locations can easily be distinguished from eachother with separate separation bubbles. The computing system, usingthese two separation bubbles, can separate and distinguish the drivercommands from the passenger's speech. In an actual scenario the computersystem will likely separate the driver's speech from the passenger'sspeech and the passenger's speech from the driver's speech. The driver'sspeech is recognized as a command to the computing system while thepassenger's speech is not recognized as a command to the computingsystem. Therefore the computing system is able to act on the speech ofthe driver without that speech being obscured by that of the passenger.

In a car, the positions of the speakers are limited to the positions ofthe seats and body movements within the positions of those seats. Theinterior environment of the car does not change and the computing systemin the car and the car is not easily moved. Accordingly setting upseparation bubbles can be done at one time before the car is provided toa customer. The in-car computing system may be configured to respond tosimultaneous commands from both speakers as in FIG. 8.

The response to the command depends upon the particular command and anyconvenient command may be supported. In the context of FIG. 8, theresponses may include providing visual or audio information, retrievingand transmitting data, and any other desired action. Such a table may beused for visitors in a lobby, for in-store interaction or point-of-sale,or as a workstation or computing station for productivity applications,among others. The in-car commands of FIG. 9 may include navigation asshown, commands to vehicle systems, such as heating, entertainment, andvehicle configurations. In addition, the commands may relate to in-carcommunications systems. A driver or passenger may lower the temperature,send a text message to another person, dial a telephone number, orlisten to a song, among other things, depending on the particularimplementation.

FIG. 10 illustrates a computing device 100 in accordance with oneimplementation of the invention. The computing device 100 houses asystem board 2. The board 2 may include a number of components,including but not limited to a processor 4 and at least onecommunication package 6. The communication package is coupled to one ormore antennas 16. The processor 4 is physically and electrically coupledto the board 2.

Depending on its applications, computing device 100 may include othercomponents that may or may not be physically and electrically coupled tothe board 2. These other components include, but are not limited to,volatile memory (e.g., DRAM) 8, non-volatile memory (e.g., ROM) 9, flashmemory (not shown), a graphics processor 12, a digital signal processor(not shown), a crypto processor (not shown), a chipset 14, an antenna16, a display 18 such as a touchscreen display, a touchscreen controller20, a battery 22, an audio codec (not shown), a video codec (not shown),a power amplifier 24, a global positioning system (GPS) device 26, acompass 28, an accelerometer (not shown), a gyroscope (not shown), aspeaker 30, a camera 32, a microphone array 34, and a mass storagedevice (such as hard disk drive) 10, compact disk (CD) (not shown),digital versatile disk (DVD) (not shown), and so forth). Thesecomponents may be connected to the system board 2, mounted to the systemboard, or combined with any of the other components.

The communication package 6 enables wireless and/or wired communicationsfor the transfer of data to and from the computing device 100. The term“wireless” and its derivatives may be used to describe circuits,devices, systems, methods, techniques, communications channels, etc.,that may communicate data through the use of modulated electromagneticradiation through a non-solid medium. The term does not imply that theassociated devices do not contain any wires, although in someembodiments they might not. The communication package 6 may implementany of a number of wireless or wired standards or protocols, includingbut not limited to Wi-Fi (IEEE 802.11 family), WiMAX (IEEE 802.16family), IEEE 802.20, long term evolution (LTE), Ev-DO, HSPA+, HSDPA+,HSUPA+, EDGE, GSM, GPRS, CDMA, TDMA, DECT, Bluetooth, Ethernetderivatives thereof, as well as any other wireless and wired protocolsthat are designated as 3G, 4G, 5G, and beyond. The computing device 100may include a plurality of communication packages 6. For instance, afirst communication package 6 may be dedicated to shorter range wirelesscommunications such as Wi-Fi and Bluetooth and a second communicationpackage 6 may be dedicated to longer range wireless communications suchas GPS, EDGE, GPRS, CDMA, WiMAX, LTE, Ev-DO, and others.

The processor 4 of the computing device 100 includes an integratedcircuit die packaged within the processor 4. The term “processor” mayrefer to any device or portion of a device that processes electronicdata from registers and/or memory to transform that electronic data intoother electronic data that may be stored in registers and/or memory. Theprocessor may be packaged as a system on a chip (SoC) that includesseveral other devices that are shown as separate devices in the drawingfigure.

In various implementations, the computing device 100 may be a laptop, anetbook, a notebook, an ultrabook, a smartphone, a tablet, a personaldigital assistant (PDA), an ultra mobile PC, a mobile phone, a desktopcomputer, a server, a printer, a scanner, a monitor, a set-top box, anentertainment control unit, a digital camera, a portable music player,or a digital video recorder. The computing device may be fixed,portable, or wearable. In further implementations, the computing device100 may be any other electronic device that processes data.

Embodiments may be implemented as a part of one or more memory chips,controllers, CPUs (Central Processing Unit), microchips or integratedcircuits interconnected using a motherboard, an application specificintegrated circuit (ASIC), and/or a field programmable gate array(FPGA).

References to “one embodiment”, “an embodiment”, “example embodiment”,“various embodiments”, etc., indicate that the embodiment(s) of theinvention so described may include particular features, structures, orcharacteristics, but not every embodiment necessarily includes theparticular features, structures, or characteristics. Further, someembodiments may have some, all, or none of the features described forother embodiments.

In the following description and claims, the term “coupled” along withits derivatives, may be used. “Coupled” is used to indicate that two ormore elements co-operate or interact with each other, but they may ormay not have intervening physical or electrical components between them.

As used in the claims, unless otherwise specified, the use of theordinal adjectives “first”, “second”, “third”, etc., to describe acommon element, merely indicate that different instances of likeelements are being referred to, and are not intended to imply that theelements so described must be in a given sequence, either temporally,spatially, in ranking, or in any other manner.

The drawings and the forgoing description give examples of embodiments.Those skilled in the art will appreciate that one or more of thedescribed elements may well be combined into a single functionalelement. Alternatively, certain elements may be split into multiplefunctional elements. Elements from one embodiment may be added toanother embodiment. For example, orders of processes described hereinmay be changed and are not limited to the manner described herein.Moreover, the actions of any flow diagram need not be implemented in theorder shown; nor do all of the acts necessarily need to be performed.Also, those acts that are not dependent on other acts may be performedin parallel with the other acts. The scope of embodiments is by no meanslimited by these specific examples. Numerous variations, whetherexplicitly given in the specification or not, such as differences instructure, dimension, and use of material, are possible. The scope ofembodiments is at least as broad as given by the following claims.

The following examples pertain to further embodiments. The variousfeatures of the different embodiments may be variously combined withsome features included and others excluded to suit a variety ofdifferent applications. Some embodiments pertain to a method thatincludes determining a position of a speaker, selecting a pre-determinedfilter based on the determined position, receiving audio from thespeaker and from other simultaneous audio sources at a microphone array,and applying the selected filter to the received audio to separate theaudio from the speaker from the audio from the other simultaneous audiosources.

In further embodiments the audio from the speaker is a spoken command,the method further includes applying speech recognition to the receivedcommand to determine the spoken command. Further embodiment includeexecuting the determined command.

In further embodiments determining a position of the speaker comprisesreceiving audio from the speaker at a plurality of microphones andcomparing delay of the received audio.

In further embodiments determining a position of the speaker comprisesobserving the speaker with a camera and using the observation todetermine the position of the speaker.

In further embodiments selecting a pre-determined filter comprisesapplying the determined position to a look-up table of differentpositions to obtain the pre-determined filter.

In further embodiments the other simultaneous audio sources comprise acommand spoken by a second speaker, the method further includesdetermining a position of the second speaker, selecting a pre-determinedfilter based on the determined position of the second speaker, andapplying the selected filter to the received audio to separate thecommand from the second speaker from the audio from the first speaker.

In further embodiments applying the selected filter comprises performinga blind source separation on the received audio. In further embodimentsthe selected filter is a finite impulse response filter. In furtherembodiments applying the selected filter comprises applying the selectedfilter in the frequency domain.

Some embodiments pertain to an apparatus with a sensor to determine aposition of a speaker, a microphone array to receive audio from thespeaker and from other simultaneous audio sources, and a processor toselect a pre-determined filter based on the determined position and toapply the selected filter to the received audio to separate the audiofrom the speaker from the audio from the other simultaneous audiosources.

In further embodiments the sensor comprises a camera. In furtherembodiments the sensor comprises the microphone array and the positionof the speaker is determined by comparing delays in the received audioat each of the plurality of microphones.

In further embodiments the audio from the speaker is a spoken command,the processor further applying speech recognition to the receivedcommand to determine the spoken command and executing the determinedcommand.

Further embodiments include a memory to store a lookup table ofdifferent speaker positions and the processor applies the determinedposition to the look-up table to obtain the pre-determined filter.

In further embodiments the lookup table is populated by using aplurality of transfer function for each determined position and byapplying each of the transfer functions to a known stored audioreference signal.

In further embodiments the determined position is compared to aplurality of overlapping isolation volumes and selecting apre-determined filter comprises selecting a filter corresponding to oneof the plurality of isolation volumes.

Some embodiments pertain to a computing system that includes a pluralityof cameras to observe a speaker and determine a position of the speaker,a plurality of microphones to receive audio from the speaker and fromother simultaneous audio sources, a processor to select a pre-determinedfilter based on the determined position, and a signal processor to applythe selected filter to the received audio to separate the audio from thespeaker from the audio from the other simultaneous audio sources, theprocessor to apply speech recognition to the received command todetermine the spoken command and to execute the determined command.

Further embodiments include a display coupled to the processor todisplay information in response to executing the command.

In further embodiments the filter is a finite impulse response filterand the signal processor applies blind source separation to separate theaudio.

1. (canceled)
 2. A method, comprising: receiving, via a microphone arrayincluding a plurality of microphones, a plurality of sounds including atleast a first speaker sound from a first speaker; determining a timedifference between a first time of receipt of the plurality of sounds ata first microphone and a second time of receipt of the plurality ofsounds at a second microphone, the first and second microphones includedin the microphone array; determining a position of the first speakerbased at least in part on the time difference; and determining a firstfilter based at least in part on the position of the first speaker;wherein the first filter is to separate the first speaker sound from theplurality of sounds.
 3. The method of claim 2, comprising: determining aposition of a second speaker based at least in part on the timedifference; and determining a second filter based at least on theposition of the second speaker, wherein: the plurality of soundsincludes a second speaker sound from the second speaker; and the secondfilter is to separate the second speaker sound from the plurality ofsounds.
 4. The method of claim 2, comprising: transcribing the firstspeaker sound.
 5. The method of claim 2, comprising: storing the firstfilter in a table; and associating the stored filter with the positionof the first speaker.
 6. A system, comprising: a microphone arrayincluding a plurality of microphones; and one or more processors toexecute instructions to: receive, via the microphone array, a pluralityof sounds including at least a first speaker sound from a first speaker;determine a time difference between a first time of receipt of theplurality of sounds at a first microphone and a second time of receiptof the plurality of sounds at a second microphone, the first and secondmicrophones included in the microphone array; determine a position ofthe first speaker based at least in part on the time difference; anddetermine a first filter based at least in part on the position of thefirst speaker; wherein the first filter is to separate the first speakersound from the plurality of sounds.
 7. The system of claim 6, wherein:the one or more processors are further to execute instructions to:determine a position of a second speaker based at least in part on thetime difference; and determine a second filter based at least on theposition of the second speaker; the plurality of sounds includes asecond speaker sound from the second speaker; and the second filter isto separate the second speaker sound from the plurality of sounds. 8.The system of claim 6, wherein the one or more processors are further toexecute instructions to transcribe the first speaker sound.
 9. Thesystem of claim 6, wherein the one or more processors are further toexecute instructions to: store the first filter in a table; andassociate the stored filter with the position of the first speaker. 10.One or more non-transitory computer-readable storage devices havingstored thereon instructions which, when executed by one or moreprocessors, result in operations comprising: receive, via a microphonearray including a plurality of microphones, a plurality of soundsincluding at least a first speaker sound from a first speaker; determinea time difference between a first time of receipt of the plurality ofsounds at a first microphone and a second time of receipt of theplurality of sounds at a second microphone, the first and secondmicrophones included in the microphone array; determine a position ofthe first speaker based at least in part on the time difference; anddetermine a first filter based at least in part on the position of thefirst speaker; wherein the first filter is to separate the first speakersound from the plurality of sounds.
 11. The one or more non-transitorycomputer-readable storage devices of claim 10, wherein the instructionscomprise instructions which, when executed by the one or moreprocessors, result in operations comprising: determine a position of asecond speaker based at least in part on the time difference; anddetermine a second filter based at least on the position of the secondspeaker, wherein: the plurality of sounds includes a second speakersound from the second speaker; and the second filter is to separate thesecond speaker sound from the plurality of sounds.
 12. The one or morenon-transitory computer-readable storage devices of claim 10, whereinthe instructions comprise instructions which, when executed by the oneor more processors, result in operations comprising: transcribe thefirst speaker sound.
 13. The one or more non-transitorycomputer-readable storage devices of claim 10, wherein the instructionscomprise instructions which, when executed by the one or moreprocessors, result in operations comprising: store the filter in atable; and associate the stored filter with the position of the firstspeaker.